Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop

نویسندگان

Marianne Shaw

Paraschos Koutris

Bill Howe

Dan Suciu

چکیده

We explore the design and implementation of a scalable Datalog system using Hadoop as the underlying runtime system. Observing that several successful projects provide a relational algebra-based programming interface to Hadoop, we argue that a natural extension is to add recursion to support scalable social network analysis, internet traffic analysis, and general graph query. We implement semi-naive evaluation in Hadoop, then apply a series of optimizations spanning fundamental changes to the Hadoop infrastructure to basic configuration guidelines that collectively offer a 10x improvement in our experiments. This work lays the foundation for a more comprehensive cost-based algebraic optimization framework for parallel recursive Datalog queries.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

A Semi-clustering Scheme for Large-Scale Graph Analysis on Hadoop

With the evolution of IT technologies, large-scale graph data have lately become a growing interest. As a result, there are a lot of research results in large-scale graph analysis on Hadoop. The graph analysis based on Hadoop provides parallel programming models with data partitioning and contains iterative phases of MapReduce jobs. Therefore, the effectiveness of data partitioning depends on h...

متن کامل

Piranha: Optimizing Short Jobs in Hadoop

Cluster computing has emerged as a key parallel processing platform for large scale data. All major internet companies use it as their major central processing platform. One of cluster computing’s most popular examples is MapReduce and its open source implementation Hadoop. These systems were originally designed for batch and massive-scale computations. Interestingly, over time their production...

متن کامل

Survey on Data Processing and Scheduling in Hadoop

There is an explosion in the volume of data in the world. The amount of data is increasing by leaps and bounds. The sources are individuals, social media, organizations, etc. The data may be structured, semi-structured or unstructured. Gaining knowledge from this data and using it for competitive advantage is the primary focus of all the organizations. In the last few years Big Data has found i...

متن کامل

Concurrent Smart Evaluation of Datalog Queries

A substantial effort has been made in the development of efficient algorithms for (both sequent ial and parallel) Datalog program evaluation. In this paper we discuss a Dataflow evaluation model and we show how standard algorithms for the bottom-up evaluation of Datalog queries can be significantly improved by means of enhancing the concurrency degree with a concurrent Dataflow evaluation model...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Optimizing Large-Scale Semi-Naïve Datalog Evaluation in Hadoop

نویسندگان

چکیده

منابع مشابه

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

A Semi-clustering Scheme for Large-Scale Graph Analysis on Hadoop

Piranha: Optimizing Short Jobs in Hadoop

Survey on Data Processing and Scheduling in Hadoop

Concurrent Smart Evaluation of Datalog Queries

عنوان ژورنال:

اشتراک گذاری